## [1] 1599
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.factor : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality.factor
## Min. : 8.40 Min. :3.000 3: 10
## 1st Qu.: 9.50 1st Qu.:5.000 4: 53
## Median :10.20 Median :6.000 5:681
## Mean :10.42 Mean :5.636 6:638
## 3rd Qu.:11.10 3rd Qu.:6.000 7:199
## Max. :14.90 Max. :8.000 8: 18
I’ll start by ploting a histogram of the quality variable to check how it’s distributed.
Now that I have the above histogram I’ll plot the histogram of another variables present in the dataset to check which ones have a distibuiton that looks like the plot above.
From all the histograms above I’d say that the variables that have more chance of having some effect or correlation with the quality of the wine are fixed.acidity, volatile.acidity, pH and density. I’ll dig a bit more into this in the Bivariate Plots Section.
This dataset has 1599 entries of the red Portuguese “Vinho Verde” wine containing 12 variables as below:
1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
The quality variable is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). So quality is a qualitative (categorical) variable. The other variables are the results objective tests (e.g. PH values).
The main feature in this dataset is the quality of the wine and I’m particularly interest in finding which variable(s) had influenced the most in the quality of those wines.
From the graphs I just ploted above I think that fixed.acidity, volatile.acidity, ph and density are good candidates for supporting this investigation.
Yes, I created the variable quality.factor which is the quality variable casted into the factor format. That may help if I want to plto some boxplots using the quality variable in the x axis.
I didn’t see any unusual distribuition but I did see some variables with very skewed histograms such as residual.sugar and chlorides.
Below I’m going to plot scatter plots with regression lines using linear model and to complement them I’ll plot bloxplots using quality.factor.
THe objective here is to see how each of the variables in the dataset relates with quality.
Analizing the plots above I noticed that quality does not correlate with most of the other variables. fixed.acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and pH.
The plots of volatile.acidity, citric.acid, density, sulphates and alcohol show that they do correlate with quality with alcohol being the variable that correlates with quality the most.
I decided to scatter plot the relationship between quality, alcohol and the other variables with strong relationships with quality: volatile.acidity, citric.acid, density, sulphates.
And isn’t it beautiful how those plots show exactly how they relate with each other? :)
They confirm the correlation numbers, some with a more dense plot, others are more spread (e.g. citric.acid). Some have more outliers then others, of course, but they confirm the correlations listed in the Bivariate Analysis Section and graphically show their relationships with each other and with quality.
Even though the quality variable is defined as numerical in the dataset, by definition it is a categorical variable and thus this distribuition can not be called normal nor we can calculate the correlation between quality and other variables. But we can say that most of the wines are of quality 5 and 6 while 3 and 8 (worst and best wines respectively) have the least counts.
Those plots show that there’s a strong relationship between the alcohol variable with quality. All 3 graphs, the scatterplot, the linear regression line and the boxplots tell the same story that the higher the % of alcohol in the wine the better the quality perception.
Those 4 scatter plots show the relationship of alcohol, quality and other 4 variables and how they relate with each other. Example: a red wine with low % of alcohol and low volatile.acidity has way more chance of being evaluated as low quality of a wine that has high % of alcohol and high volatile.acidity.
I’ve found that those 5 variables (alcohol, volatile.acidity, citric.acid, density and sulphates) have a strong correlation with the wines quality score. To take this analysis a step further I would try to create a regression model using those 5 variables to calculate the wine’s quality based on the values of those variables.